Covid-19 pandemic, something we are much aware of today, has had a devastating impact on the world in the recent years. It is much more than just a health threat and has affected each individual in some way or the other. With lockdowns and emergencies in all parts of the world, it has changed the way everything used to function. The economic and social disruption caused by the pandemic is huge, and has caused crisis even in well developed countries. Millions of people lost their job, and are at risk of extreme poverty. Lot of enterprises reached a state of existential threat. But all of this can be improved in the future. Something we cannot change is the dramatic loss of human life worldwide and the effect it has had on the health of people.
The best way to really understand how the coronavirus pandemic has affected the world is using statistics. Let us first start with analyzing its spread and the number of deaths it caused.
# Importing the libraries needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import widgets, interactive
import plotly.io as pio
import plotly.express as px
pio.renderers.default='notebook'
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
# Reading the data
covid_df = pd.read_csv("./data/owid-covid-data.csv.gz", compression="gzip")
A lot of data in the dataset is missing, and thus we drop the rows which have missing data in the necessary columns.
required_columns = ["iso_code", "location", "continent", "date"]
covid_df = covid_df.dropna(subset = required_columns)
Following is the visualization of the data. Some of the relavant columns in our data for each country include:
# Visualizing the data
covid_df.sample(5)
| Unnamed: 0 | iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | ... | female_smokers | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59448 | 59448 | GIB | Europe | Gibraltar | 2021-08-10 | 5154.0 | 18.0 | 17.571 | 95.0 | 0.0 | ... | NaN | NaN | NaN | NaN | 79.93 | NaN | NaN | NaN | NaN | NaN |
| 85244 | 85244 | LSO | Africa | Lesotho | 2021-02-08 | 9380.0 | 0.0 | 68.571 | 183.0 | 0.0 | ... | 0.4 | 53.9 | 2.117 | NaN | 54.33 | 0.527 | NaN | NaN | NaN | NaN |
| 11519 | 11519 | BHS | North America | Bahamas | 2021-04-01 | 9171.0 | 52.0 | 33.714 | 188.0 | 0.0 | ... | 3.1 | 20.4 | NaN | 2.90 | 73.92 | 0.814 | NaN | NaN | NaN | NaN |
| 124999 | 124999 | RWA | Africa | Rwanda | 2021-12-24 | 103799.0 | 629.0 | 373.714 | 1345.0 | 0.0 | ... | 4.7 | 21.0 | 4.617 | NaN | 69.02 | 0.543 | NaN | NaN | NaN | NaN |
| 73845 | 73845 | IRL | Europe | Ireland | 2021-08-04 | 305527.0 | 1217.0 | 1262.857 | 5044.0 | 9.0 | ... | 23.0 | 25.7 | NaN | 2.96 | 82.30 | 0.955 | NaN | NaN | NaN | NaN |
5 rows × 68 columns
The following plot denotes the number of new cases in a day. We can move on the slider timeline bar to get the plot for a particular day.
If you check for around April 2021, we observe that India has the most rise in cases. This was the time India was going through the second wave of the pandemic, and thus we had most new cases in a day.
tmp_df = covid_df.dropna(subset=['new_cases_smoothed'])
fig = px.scatter_geo(tmp_df, locations="iso_code", color="continent",
hover_name="location", size="new_cases_smoothed",
projection="natural earth", animation_frame="date", template="seaborn")
fig.show()
The next plot shows the total number of cases with time for each country. We can move on the slider timeline bar to see the rise of cases around the world.
tmp_df = covid_df.dropna(subset=['total_cases'])
fig = px.scatter_geo(tmp_df, locations="iso_code", color="continent",
hover_name="location", size="total_cases",
projection="natural earth", animation_frame="date", template="seaborn")
fig.show()
Now let us visualize the number of deaths due to Covid-19.
We first drop the rows which have missing data in the relevant coulumns.
cols = ['location', 'total_deaths_per_million', 'continent', 'total_deaths']
df = covid_df.dropna(subset = cols)
df.sample(5)
| Unnamed: 0 | iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | ... | female_smokers | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99222 | 99222 | MDA | Europe | Moldova | 2020-08-09 | 27660.0 | 217.0 | 328.286 | 845.0 | 4.0 | ... | 5.9 | 44.6 | 86.979 | 5.8 | 71.90 | 0.750 | NaN | NaN | NaN | NaN |
| 124573 | 124573 | RWA | Africa | Rwanda | 2020-10-24 | 5060.0 | 8.0 | 12.714 | 34.0 | 0.0 | ... | 4.7 | 21.0 | 4.617 | NaN | 69.02 | 0.543 | NaN | NaN | NaN | NaN |
| 163126 | 163126 | WLF | Oceania | Wallis and Futuna | 2021-12-22 | 454.0 | 0.0 | 0.000 | 7.0 | 0.0 | ... | NaN | NaN | NaN | NaN | 79.94 | NaN | NaN | NaN | NaN | NaN |
| 13362 | 13362 | BRB | North America | Barbados | 2020-04-19 | 75.0 | 0.0 | 0.571 | 5.0 | 0.0 | ... | 1.9 | 14.5 | 88.469 | 5.8 | 79.19 | 0.814 | NaN | NaN | NaN | NaN |
| 11809 | 11809 | BHS | North America | Bahamas | 2022-01-16 | 30850.0 | 195.0 | 318.714 | 719.0 | 0.0 | ... | 3.1 | 20.4 | NaN | 2.9 | 73.92 | 0.814 | NaN | NaN | NaN | NaN |
5 rows × 68 columns
The following plot shows the total number of deaths in each country with time. The x-axis represents the date and the y-axis represents the number of deaths.
We can choose to visualize the plot for a particular country by double clicking on that country. We can add the plots for countries to compare with a single click on the country we wish to add. Now if we want to go back and visualize plots for all countries, we can double click on any unselected country.
We can also hover on the graph to get the exact number of deaths of a particular country till that date.
# Total Deaths for each country
fig = px.line(df, x="date", y="total_deaths", title='Total Deaths', hover_name='location', color='location')
fig.show()
Does the above plot mean that US, Brazil, and India are the worst affected countries due to Covid? No! This is because each country has different population and thus we should not compare them with the total deaths. One possible solution would be to compare the deaths per million. To understand it better and prove the above, we can take a simple example.
Below is a similar plot, but here, instead of the total deaths, we have the plot for total deaths per million. If you select India and Georgia in the above plot, we can see that the total deaths in India (514k) is a lot more than that of Georgia (16k). Now, select the same two countries in the below plot. The total deaths per millions for India as observed is around 350, but on the other hand, the total deaths per million for Georgia is around 4000!
This is a big misconception that a lot of poeple have, and thus small contries like Georgia don't make it to the news even if they are more severely affected, and don't get the attention and help they should.
# Total Deaths per million for all countries
fig = px.line(df, x="date", y="total_deaths_per_million", title='Total Deaths per million', hover_name='location', color='location')
fig.show()
To get a better insight, we should view the statistics for each continent separately as each of them have different resources. Below are the plots for the top 5 countries with the most and least deaths in each continent.
def death_per_continent(df, continent, top=5, bottom=5):
df_continent = df[df['continent']==continent]
df_continent = df_continent.sort_values('total_deaths_per_million', ascending=False)
df_continent[['location', 'date', 'total_deaths_per_million']]
locs = list(df_continent.location.unique()[:top])
locs += list(df_continent.location.unique()[-bottom:])
df_continent = df_continent[df_continent['location'].isin(locs)]
fig = px.line(df_continent, x="date", y="total_deaths_per_million", title=f'Total Deaths Per Million in {continent}', hover_name='location', color='location')
return fig
continent = covid_df.continent.unique()
fig = death_per_continent(df, continent[0])
fig.show()
fig = death_per_continent(df, continent[1])
fig.show()
fig = death_per_continent(df, continent[2])
fig.show()
fig = death_per_continent(df, continent[3])
fig.show()
fig = death_per_continent(df, continent[4])
fig.show()
To bring this long running pandemic to an end, an efficient and inclusive distribution of covid-19 vaccines could be our next most prospective step. If order to take action along these lines, we first need to understand how the current covid vaccination drives are runnning. If we are to interpret this data, we will be able to identify any underlying inequalities that might be happening during the distribution of covid vaccinations. So, our task is to understand covid-19 vaccinations data worldwide and draw inferences from the same to understand how covid-19 vaccination drives are going. We also plan to understand the underlying inequalities across the world.
Let us now see the distribution of vaccine in three countries, one developed, one developing, and one under developed.
def vacc_by_country(df, country):
df_country = df[df['location']==country]
fig = px.line(df_country, x='date', y=['people_vaccinated_per_hundred','people_fully_vaccinated_per_hundred','total_boosters_per_hundred'], title = f'% of people vaccinated in {country}')
return fig
cols = ['date','location','people_vaccinated_per_hundred']
covid_df = covid_df.dropna(subset = cols)
fig = vacc_by_country(covid_df, 'Canada')
fig.show()
fig = vacc_by_country(covid_df, 'India')
fig.show()
fig = vacc_by_country(covid_df, 'Chad')
fig.show()
Now let us have a look at how there was a rise in the number of doses of different vaccines in various countries. The type of vaccine and thus the manufacturer played an important role in the vaccination drives due to their cost and success rate.
First we import the data. The data contains the number of total doses of different vaccines with time in each country.
# Importing the data
v_by_manu = pd.read_csv("./data/vaccinations/vaccinations-by-manufacturer.csv")
v_by_manu.sample(10)
| location | date | vaccine | total_vaccinations | |
|---|---|---|---|---|
| 20978 | Peru | 2021-12-26 | Sinopharm/Beijing | 18422637 |
| 7941 | Ecuador | 2021-07-04 | Pfizer/BioNTech | 1813690 |
| 9569 | France | 2021-04-29 | Johnson&Johnson | 44082 |
| 29063 | Ukraine | 2021-12-29 | Johnson&Johnson | 20680 |
| 21660 | Portugal | 2021-05-07 | Sinovac | 803 |
| 3190 | Belgium | 2021-11-12 | Pfizer/BioNTech | 13254824 |
| 32507 | European Union | 2021-03-22 | Moderna | 3242472 |
| 1680 | Argentina | 2021-10-22 | Sputnik V | 16736291 |
| 20017 | Norway | 2021-10-22 | Johnson&Johnson | 6655 |
| 25466 | South Korea | 2021-12-25 | Novavax | 1 |
Following is the plot for the country Argentina. The x-axis represents the time series and the y-axis contains the number of total doses of that particular vaccine. We can observe how certain vaccines saw a sudden rise in their production.
j="Argentina"
v_arg=v_by_manu[v_by_manu["location"]==j]
for i in v_arg["vaccine"].unique():
v_arg_spu=v_arg[v_arg["vaccine"]==i]
plt.plot(v_arg_spu["date"],v_arg_spu["total_vaccinations"], label=i)
x_ticks = ["2021-01-01", "2021-04-01", "2021-07-01", "2021-10-01", "2022-01-01"]
x_labels = ['1-21', '4-21', '7-21', '10-21', '1-22']
plt.xticks(ticks=x_ticks, labels=x_labels)
plt.legend()
plt.xlabel('Date')
plt.ylabel('Total Vaccinations till Date')
plt.title(j)
Text(0.5, 1.0, 'Argentina')
We can also create a interactive plot in the following way. When run in python, this gives us a dropdown to select the country for which we need to analyse the number of doses of different vaccines used in that country.
area = widgets.Dropdown(
options=v_by_manu["location"].unique(),
value='Argentina',
description='Country',
)
def plotit(area):
v_arg=v_by_manu[v_by_manu["location"]==area]
x_ticks = [v_arg["date"].min(),v_arg["date"].max()]
for i in v_arg["vaccine"].unique():
v_arg_spu=v_arg[v_arg["vaccine"]==i]
plt.plot(v_arg_spu["date"],v_arg_spu["total_vaccinations"], label=i)
x_labels = x_ticks
plt.xticks(ticks=x_ticks, labels=x_labels)
plt.legend()
plt.xlabel('Date')
plt.ylabel('Total Vaccinations till Date')
plt.title(area)
interactive(plotit, area=area)
interactive(children=(Dropdown(description='Country', options=('Argentina', 'Austria', 'Belgium', 'Bulgaria', …